About the Data and Data Cleaning

Introduction

To explore the use of online sources in health communication, two publicly available datasets were used, HINTS and Reddit Pushshift. The Health Information National Trends Survey (HINTS) collects information about the Amercian’s use of cancer-related information. In this project, eleven questions from the HINTS survey were collected for review. These questions focus on gathering information about individuals’ behaviors, trust, and perceptions related to cancer information and health communication. Collectively, these questions provide insights into the sources and levels of trust Americans place in health and cancer information, their experiences in searching for such information, and their perceptions of the reliability of online and social media health content. HINTS is used to better understand how the American public looks for health information for themselves or their loved ones.

The Reddit Pushshift dataset includes free text data from the Reddit online forum where users can post and look for information. It also includes subreddits, which branch off into various subreddits—dedicated communities centered around specific topics. By examining this dataset, this project complements insights from HINTS, offering a unique perspective on how individuals seek, share, and discuss cancer-related health information online. The analysis of Reddit data allows researchers to explore the dynamic and informal exchanges that occur in digital forums to better our understanding of how cancer information is communicated.

HINTS Preparation of Data

Code
import pyreadr
import pandas as pd
from IPython.display import display

# Load the .rda file
result = pyreadr.read_r('/Users/elizabethkovalchuk/Documents/DSAN6000/Project/fall-2024-project-team-35/data/HINTS6_R_20240524/hints6_public.rda')

# Extract the DataFrame from the loaded data
hints = result['public']  # Assuming 'public' is the name of the R object in the file

# Specify the columns to select
columns = [
    "HHID", "SeekCancerInfo", "CancerFrustrated", "CancerTrustDoctor",
    "CancerTrustFamily", "CancerTrustGov", "CancerTrustCharities",
    "CancerTrustReligiousOrgs", "CancerTrustScientists", "Electronic2_HealthInfo",
    "MisleadingHealthInfo", "TrustHCSystem"
]

# Select the relevant columns
hints_select = hints[columns]

# # Convert the 'updatedate' column if required (commented for now)
# hints_select['updatedate'] = pd.to_datetime(hints_select['updatedate'] / 1000, unit='s')

# Preview the first few rows
print("Sample data from the HINTS dataset:")
display(hints_select.head())
print(f"Shape of the original dataset: {hints_select.shape}")
Sample data from the HINTS dataset:
HHID SeekCancerInfo CancerFrustrated CancerTrustDoctor CancerTrustFamily CancerTrustGov CancerTrustCharities CancerTrustReligiousOrgs CancerTrustScientists Electronic2_HealthInfo MisleadingHealthInfo TrustHCSystem
0 21000006 No Inapplicable, coded 2 in SeekCancerInfo A lot Missing data (Not Ascertained) Missing data (Not Ascertained) Missing data (Not Ascertained) Missing data (Not Ascertained) Missing data (Not Ascertained) Question answered in error (Commission Error) I do not use social media Very
1 21000009 No Inapplicable, coded 2 in SeekCancerInfo A lot Some A lot Some Some A lot Yes I do not use social media Very
2 21000020 Yes Somewhat disagree A lot Some Some A little Not at all A lot Yes Some Somewhat
3 21000022 No Inapplicable, coded 2 in SeekCancerInfo A lot Missing data (Not Ascertained) Missing data (Not Ascertained) Missing data (Not Ascertained) Missing data (Not Ascertained) Missing data (Not Ascertained) Inapplicable, coded 2 in UseInternet I do not use social media Somewhat
4 21000039 No Inapplicable, coded 2 in SeekCancerInfo Some Some Some Not at all Not at all Some Yes A lot Somewhat
Shape of the original dataset: (6252, 12)
Code
# Count missing values in each column
missing_values = hints_select.isna().sum()

# Display the count of missing values
print("Missing values per column:")
display(missing_values)
Missing values per column:
HHID                        0
SeekCancerInfo              0
CancerFrustrated            0
CancerTrustDoctor           0
CancerTrustFamily           0
CancerTrustGov              0
CancerTrustCharities        0
CancerTrustReligiousOrgs    0
CancerTrustScientists       0
Electronic2_HealthInfo      0
MisleadingHealthInfo        0
TrustHCSystem               0
dtype: int64
Code
# List of ordinal columns
ordinal_columns = [
    "SeekCancerInfo", "CancerFrustrated", "CancerTrustDoctor",
    "CancerTrustFamily", "CancerTrustGov", "CancerTrustCharities",
    "CancerTrustReligiousOrgs", "CancerTrustScientists", "Electronic2_HealthInfo",
    "MisleadingHealthInfo", "TrustHCSystem"
]

# Display unique values for each ordinal column
print("Unique values for ordinal columns:")
for column in ordinal_columns:
    unique_values = hints_select[column].unique()
    print(f"\nColumn: {column}")
    print(f"Unique Values: {unique_values}")
Unique values for ordinal columns:

Column: SeekCancerInfo
Unique Values: ['No', 'Yes', 'Missing data (Not Ascertained)']
Categories (3, object): ['Missing data (Not Ascertained)', 'No', 'Yes']

Column: CancerFrustrated
Unique Values: ['Inapplicable, coded 2 in SeekCancerInfo', 'Somewhat disagree', 'Strongly disagree', 'Somewhat agree', 'Strongly agree', 'Question answered in error (Commission Error)', 'Missing data (Filter Missing)', 'Missing data (Not Ascertained)', 'Multiple responses selected in error']
Categories (9, object): ['Inapplicable, coded 2 in SeekCancerInfo', 'Missing data (Filter Missing)', 'Missing data (Not Ascertained)', 'Multiple responses selected in error', ..., 'Somewhat agree', 'Somewhat disagree', 'Strongly agree', 'Strongly disagree']

Column: CancerTrustDoctor
Unique Values: ['A lot', 'Some', 'Not at all', 'A little', 'Missing data (Not Ascertained)', 'Multiple responses selected in error']
Categories (6, object): ['A little', 'A lot', 'Missing data (Not Ascertained)', 'Multiple responses selected in error', 'Not at all', 'Some']

Column: CancerTrustFamily
Unique Values: ['Missing data (Not Ascertained)', 'Some', 'A little', 'Not at all', 'A lot', 'Multiple responses selected in error']
Categories (6, object): ['A little', 'A lot', 'Missing data (Not Ascertained)', 'Multiple responses selected in error', 'Not at all', 'Some']

Column: CancerTrustGov
Unique Values: ['Missing data (Not Ascertained)', 'A lot', 'Some', 'A little', 'Not at all', 'Multiple responses selected in error']
Categories (6, object): ['A little', 'A lot', 'Missing data (Not Ascertained)', 'Multiple responses selected in error', 'Not at all', 'Some']

Column: CancerTrustCharities
Unique Values: ['Missing data (Not Ascertained)', 'Some', 'A little', 'Not at all', 'A lot', 'Multiple responses selected in error']
Categories (6, object): ['A little', 'A lot', 'Missing data (Not Ascertained)', 'Multiple responses selected in error', 'Not at all', 'Some']

Column: CancerTrustReligiousOrgs
Unique Values: ['Missing data (Not Ascertained)', 'Some', 'Not at all', 'A little', 'A lot', 'Multiple responses selected in error']
Categories (6, object): ['A little', 'A lot', 'Missing data (Not Ascertained)', 'Multiple responses selected in error', 'Not at all', 'Some']

Column: CancerTrustScientists
Unique Values: ['Missing data (Not Ascertained)', 'A lot', 'Some', 'A little', 'Not at all', 'Multiple responses selected in error']
Categories (6, object): ['A little', 'A lot', 'Missing data (Not Ascertained)', 'Multiple responses selected in error', 'Not at all', 'Some']

Column: Electronic2_HealthInfo
Unique Values: ['Question answered in error (Commission Error)', 'Yes', 'Inapplicable, coded 2 in UseInternet', 'No', 'Missing data (Not Ascertained)', 'Missing data (Filter Missing)']
Categories (6, object): ['Inapplicable, coded 2 in UseInternet', 'Missing data (Filter Missing)', 'Missing data (Not Ascertained)', 'No', 'Question answered in error (Commission Error)', 'Yes']

Column: MisleadingHealthInfo
Unique Values: ['I do not use social media', 'Some', 'A lot', 'A little', 'None', 'Missing data (Not Ascertained)', 'Missing data (Web partial - Question Never Se...]
Categories (7, object): ['A little', 'A lot', 'I do not use social media', 'Missing data (Not Ascertained)', 'Missing data (Web partial - Question Never Se..., 'None', 'Some']

Column: TrustHCSystem
Unique Values: ['Very', 'Somewhat', 'A little', 'Not at all', 'Missing data (Web partial - Question Never Se..., 'Missing data (Not Ascertained)', 'Multiple responses selected in error']
Categories (7, object): ['A little', 'Missing data (Not Ascertained)', 'Missing data (Web partial - Question Never Se..., 'Multiple responses selected in error', 'Not at all', 'Somewhat', 'Very']
Code
# Define the valid scales for each column
valid_scales = {
    "CancerFrustrated": ['Somewhat disagree', 'Strongly disagree', 'Somewhat agree', 'Strongly agree'],
    "CancerTrustDoctor": ['A lot', 'Some', 'Not at all', 'A little'],
    "CancerTrustFamily": ['A lot', 'Some', 'Not at all', 'A little'],
    "CancerTrustGov": ['A lot', 'Some', 'Not at all', 'A little'],
    "CancerTrustCharities": ['A lot', 'Some', 'Not at all', 'A little'],
    "CancerTrustReligiousOrgs": ['A lot', 'Some', 'Not at all', 'A little'],
    "CancerTrustScientists": ['A lot', 'Some', 'Not at all', 'A little'],
    "TrustHCSystem": ['A lot', 'Some', 'Not at all', 'A little'],
    "Electronic2_HealthInfo": ['Yes', 'No'], 
    "MisleadingHealthInfo": ['I do not use social media', 'None', 'A little', 'Some', 'A lot']  
}

# Create a copy of the original DataFrame
hints_cleaned = hints_select.copy()

# Filter the DataFrame
for column, scale in valid_scales.items():
    hints_cleaned = hints_cleaned[hints_cleaned[column].isin(scale)]

# Display the cleaned dataset and its shape
print("Data after filtering invalid values:")
display(hints_cleaned.head())
print(f"Shape of the cleaned dataset: {hints_cleaned.shape}")
Data after filtering invalid values:
HHID SeekCancerInfo CancerFrustrated CancerTrustDoctor CancerTrustFamily CancerTrustGov CancerTrustCharities CancerTrustReligiousOrgs CancerTrustScientists Electronic2_HealthInfo MisleadingHealthInfo TrustHCSystem
51 21000330 Yes Somewhat disagree Some Not at all Some Some Not at all A lot Yes A lot A little
112 21000976 Yes Somewhat agree A lot Some Some Some Some A lot Yes Some A little
136 21001112 Yes Somewhat disagree A little A little Not at all Not at all Not at all A little No A lot Not at all
157 21001283 Yes Somewhat disagree A lot Some Not at all A little Some Not at all No I do not use social media Not at all
181 21001548 Yes Strongly agree A lot Some Not at all Some A lot A little Yes Some A little
Shape of the cleaned dataset: (323, 12)
Code
# Count the number of NA or NaN values in each column
na_count = hints_cleaned.isna().sum()
#print("NA values count per column:")
#print(na_count)
#print(hints_cleaned.shape)
# Count unique values in the 'SeekCancerInfo' column
value_counts = hints_cleaned['SeekCancerInfo'].value_counts()
#print("Unique value counts in 'SeekCancerInfo':")
#print(value_counts)

# Save the cleaned dataset to an Excel file
output_file = "../data/csv/hints_cleaned_forML_spearman.csv"
hints_cleaned.to_csv(output_file, index=False)

#print(f"Cleaned dataset saved as {output_file}")
Unique value counts in 'SeekCancerInfo':
SeekCancerInfo
Yes                               323
Missing data (Not Ascertained)      0
No                                  0
Name: count, dtype: int64
Cleaned dataset saved as ../data/csv/hints_cleaned_forML_spearman.csv

Reddit Preparation of the Data

The data was queried from the Reddit Pushshift dataset. Following the themes captured in the HINTs dataset, we performanced an intial eight queries searching for comments that included keywords in each of the questions in the HINTs dataset. The initial query was performed in AWS on a sample of the data. After reviewing some of the comments, all the unique subreddits were found. Searching through these subreddits, we made a list of subreddits that actually included comments about cancer and filtered out any of the subreddits that were not relevant to health at all.

List of Cancer Subreddits that discussed cancer in the comments.

subreddit_list = ['CrohnsDisease', 'thyroidcancer', 'AskDocs', 'UlcerativeColitis', 'Autoimmune', 
              'BladderCancer', 'breastcancer', 'CancerFamilySupport', 'doihavebreastcancer', 
              'WomensHealth', 'ProstateCancer', 'cll', 'Microbiome', 'predental', 'endometrialcancer', 
              'cancer', 'Hashimotos', 'coloncancer', 'PreCervicalCancer', 'lymphoma', 'Lymphedema', 
              'CancerCaregivers', 'braincancer', 'lynchsyndrome', 'nursing', 'testicularcancer', 'leukemia', 
              'publichealth', 'Health', 'Fuckcancer', 'HealthInsurance', 'BRCA', 'Cancersurvivors', 
              'pancreaticcancer', 'skincancer', 'stomachcancer']

These subreddits were compared to a random sample from the full Reddit dataset excluding the list of cancer subreddits above.

Queries were conducted in Azure ML using Spark, with the data sourced from the instructor’s Azure Blob container. Comments from both cancer-related and non-cancer subreddits were processed using an Azure ML job and saved as Parquet files in an Azure Blob container. The job applied a filter to separate cancer subreddits into one Parquet file and non-cancer subreddits into another. For the non-cancer subreddits, the data was randomized before filtering out the cancer-related subreddits.

Code
# Path to the Azure ML Blob Container
workspace_default_storage_account = "projectgstoragedfb938a3e"
workspace_default_container = "azureml-blobstore-becc8696-e562-432e-af12-8a5e3e1f9b0f"
workspace_wasbs_base_url = f"wasbs://{workspace_default_container}@{workspace_default_storage_account}.blob.core.windows.net/"

comments_path = "cancer/comments"
submissions_path = "cancer/submissions"

PySpark was used to clean the data by removing leading and trailing whitespaces, removing punctuation (using regex), removing underscores, and converting to lowercase. Both subsets of data were limited to 10,000 rows in order to allow a reasonable compute time for each job. After the data was cleaned it was saved into a two parquet files in an Azure ML blob container to use for the rest of the project. The combined cancer subreddits and non-cancer subreddit totaled in 20,000 rows.

Code
# Cancer subset of Reddit Data saved to an Azure ML Blob Container
output_path = f"{workspace_wasbs_base_url}cancer_subreddit.parquet"

# Non-cancer subset of Reddit Data saved to an Azure ML Blob Container
output_path = f"{workspace_wasbs_base_url}not_cancer_subreddit.parquet"

The source code for the cleaning the Reddit data is in GitHub: fall-2024-project-team-35/code/spark-job-sample-data